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ABSTRACT 

Sequential probability ratio testing (PRT) , which 
usually is applied in situations requiring a decision between two 
simple hypotheses or a single decision point, is extended to include 
situations involving k decision points and [(k + l)-choose-2] sets of 
simultaneous, simple hypotheses, where k>l. The multiple-decision 
point or multiple-category SPRT procedure can be used to classify 
examinees into k + 1 categories using computer adaptive methods. 
Computer simulations utilizing a 200-item pool of previously 
calibrated test items show that the multiple-category SPRT method 
controls misclass if icat ion error rates adequately, provided that the 
number of decision points is not too large. Twelve figures are 
included. (Contains 7 references.) (Author) 
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Abstract 

Sequential probability ratio testing (SPRT), .vhich usually is applied in situations requiring a 
decision between two simple hypotheses or a single decision point, is extended to include 
situations involving k decision points and [{k + l)-choose"2] sets of simultaneous, simple 
hypotheses, where A' > 1 . The multiple-decision point or multiple-category SPRT procedure can 
be used to classify examinees into k + 1 categories using computer adaptive methods. Computer 
simulations utilizing a 2{)()-item pool of previously calibrated test items show that the multiple- 
category SPRT method controls misclassification error rates adequately, provided that the number 
of decision points is not too large. 



Multiple-Category Classification Using a Sequential Probability Ratio Test 

Wald's (1947) sequential probability ratio testing (SPRT) procedure has been used with cognitive 
tests to classify examinees into one of two categories (e.g., pass/fail, master/nonmaster, 
certified/noncertified) (Reckase, 1983). In other words this procedure is useful for detennining 
whether an examinee more likely belongs to one of two states or conditions: either an individual 
has ability or latent trait greater than or equal to some minimum value, 5 or that same individual 
has ability less than the minimum value, 5. The value, 5, is frequently called a passing score or 
decision point. 

One way to test the composite hypothesis that either the examinee has latent ability less 
than 6 versus that the examinee has latent ability greater than or equal to 5, is to consider simple 
hypotheses, Hn or H,, regarding the unidimensional latent trait or ability (B|) of the examinee 
taking the test. These simple hypotheses can be written as 

Ho: 6, = 6n 
vs. 

H,: 0. = 0, , 

where 0, is an unknown parameter of the distribution of the dichotomous response to a particular 
test item, X (Silvey, 1975). Usually, 0^ and 0, represent decision points that correspond to lower 
and upper limits, respectively, of the passing criterion or threshold, 5, where < 5 < G^. Tlie 
SPRT can then be used to test the composite hypotheses. Ho: 0, < 5 versus H,: 0^ > 6 by 
considering two weaker hypotheses, say cOo = {0:0<0o} and co, = {e:0>e,} (Silvey, 1975; Wald, 
1947). 

In the case of cognitive testing, X can be assumed to follow a binomial distribution. If 
P(9|) is the probability that examinee / responds correctly to an item, and Q(0|) = 1 - P(0,) is the 
probability of an incorrect response from examinee /, then, for this single item, the random 



variable, X, represents a single Bernoulli trial and is distributed as Bin{P(9,),l }. Then, 
where 

1 , correct response x - 
0, incorrect response . 

For this test item, the probability of observing X = ,v under the alternative hypothesis is 7i:(9j). 
Under the null hypothesis, the probability of observing X - x is n{Qn), The functions, 7i:(9,) and 
7C(6o), are called likelihood functions of .v, and a ratio of these two functions, L(,v) - Ti(Q^)/n(Q^X 
is called a likelihood ratio. 

Two error probabilities, a and P, can be defined, where 

Prob(choosing Hj if is true) = a 
and 

Prob(choosing if H, is true) = p. 
Wald (1947) stated that even though the nominal error rates, ot and (5, are established prior to 
testing, the actual error rates observed in practice, a* and (3*, are bounded from above by 
functions of the nominal rates, or a*<a/(l-P) and (3* < p/( 1 -a). Wald (1947) also defined *wo 
likelihood ratio boundaries that are functions of a and p. These boundaries are A and B, where 
the lower boundary^ B > p/(l-a) and the upper boundary = /4 < (l-P)/a. 

According to Wald's SPRT, item responses are observed in sequence, .v,, A'n, and 
following each observation, the likelihood ratio, L(a'j, Xn|9o,9i), is computed, assuming 

conditional independence, where 

7C,(90 K,{QO ... 7r„(9,) 

L(A'j, A'2> A'n|9Q,0,) = 

7C,(9o) 71^(90) ... Jt„(9o) 



The likelihood ratio is then compared to the boundaries, A and B, If 

L(v,, x^ A'ni6oi9i) ^ A, then H, is accepted and the examinee is classified as 0, > 5. If 

LUp x.y vJ6o,0|) < then is accepted and the examinee is classified as 0, < 8. If 

B < L(.v,, ^\ .vj0o,0,) < .4, no decision is made and another item response must be observed 

if a decision is to be made with the specified error rates. 

Any test administered vith the SPRT procedure is, by its very nature, adaptive in thai 
examinees with different abilities (i.e., different values of 0j) could have different expected lest 
lengths, //j, the number of items that must be administered before a classification is made. 
Typically, those examinees with 0; < 0o or 6, > 0, will have shorter expected test lengths than 
those with 0„ < 0, < 0^ 

To facilitate the SPRT procedure for criterion-referenced testing, the value of 5 usually 
corresponds to a minimum proportion, p(5), of m items in the item pool that an examinee is 
expected to answer correctly in order to be classified as 0j > 0,. If p(8) is known a priori, then 

5 can be found by solving for 5 in the expression, /;(5) = 1/m Z Pj(5), j = 1, 2 tiu The item 

functions, Pj(5), are typically expressed as 3-parameter logistic item respon.se functions with 
known (i.e., calibrated) item parameters. 

Values for 0, and 0^ are .selected according to the precision that is desired. Values of 00 
and 9i that are close to each other imply high precision, while greater differences in Oj, and 0i 
imply less precision. Normally, 0, and 0^ are .selected to be equidistant from 6, although this is 
not a necessary condition for the SPRT procedure. The region from 0^^ to 9, is known as the 
indifference region because there is usually an amount of indifference associated with the 
classification made for individuals within that region. The distance, |0j - 0(,|, is the width of 
the indifference region. Test length is a function of this region; for fixed values of a and (i, a 
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larger indifference region results in shorter expected test lengths for all examinees (Reckase, 
1983: Spray & Reckase, 19«7). 

Within the context of an adaptive test, the m items in the item pool are usually ranked 
! through m on the basis of item infomiation at the decision point, /;(5) or equivalently, 6, and 
then administered in sequence to each examinee. Therefore, many examinees could receive some 
of the same items as all other examinees taking the sequential test. Because this is usually 
undesirable from a test security standpoint, some randomization scheme can be employed to 
assure that item-exposure rates (i.e., the number of times that any item is presented to examinees) 
are controlled. 

In addition there is usually some maximum number of test items or maximum test length 
(MTL), that, from a practical standpoint in terms of testing time, can be presented to a single 
examinee. Frequently, a forced classification is made once this maximum number of items has 
been reached and no classification under the likelihood ratio test has occurred. Typically, after 
reaching this maximum test length, log{L(A-,, .v., ,v„„)} is compared to the log of the SPRT 
boundaries, A and B. Classification is then made according to some distance rule, for example 
by M1N{ |logL(A'„ x., A'„,J-lo&4|, |logL(A*„ a., A^,,)-logB | }. For tests where MTL is 
fairl> small, forced classifications can occur for many examinees. The effect of forced 
classification on the SPRT procedure is to alter the actual classification error rates, a* and (3\ 
reducing the classification accuracy. 

Multiple Categories 

When a testing situation requires classification into more than two categories, such as into 
one of a number of entry level courses, a modified SPRT procedure can be used (WetherilL 



1975). The purpose of this paper is to describe one such multiple-category modification and to 
report on the results of computer-simulated SPRTs requiring multiple classifications. 
A Sequential Probability Ratio Test Involving Two Decision Points 

Suppose that the purpose of a SPRT is to classify an examinee into one of k+\ categories 
(e.p.. hierarchically ordered mathematics courses), where k is the number of decision points 
required. For the following discussion, it is assumed that k = 2. The three categories of possible 
mutually exclusive classification are 9j < 6,, 6, < 9, < 82, or 9, > 62. The values of 6, and 62 
are established or known a priori. However, because the usual vSPRT tests hypotheses about 
single values of 9, defined by the endpoints of the indifference region, such a region must be 
constructed around each decision point. One such endpoint can be chosen midway between 6, 
and 82 or (8, + MIDIST) where MIDIST = (8. - 8,)/2. This 9 value is labeled 92, while another 
0 value (9j) can be chosen, such that 0, = 8j - MIDIST. This result gives an indifference region 
around 8, of size |92 - 9,|, = 2 X MIDIST. A similar indifference region can be constructed 
around 8. using 02 and 9, as indifference region endpoints, where 9, = 82 + MIDIST. These 
three values of 9 form the set {9i, 92, 93}, where 9, < 92 < 9^ Once these values of 9 are 
established, three sets of SPRT hypotheses can be formulated: 



All three sets of hypotheses are tested after each item response is obtained, and the 
following decisions f«ie made, based on the results of these tests: 

Decision 1 is made (0^ < 8,) when H, and are both accepted. 

Decision 2 is made (81 < 0| < 82) when and are both accepted. 

5 10 



Decision 3 is made (9, > 82) when and H3 are both accepted. 
Otherwise, testing continues. 

For each SPRT, test items can no longer be ranked for sequential administration by item 
information at a sini^lc decision point because there is more than one such point. A reasonable 
compromise is to rank items by item information at the decision point that is closer to an 
estimate of the examinee's ability based upon the responses to previous items. For this study, 
a Bayes estimate of 9, is obtained (Owen, 1975) for each examinee after each item response, and 
the viable test items remaining in the pool are then ranked, by item information, at this decision 
point and administered. The process continues until a decision is reached for each examinee. 

Establishing error rates. In order to test the set of three hypotheses given above, desired 
error rates must be provided. These are used to derive the critical values for the likelihood ratio 
tests. l.et /^hij designate the probability that 9 = 9^ is accepted, given that 9 = 9j is correct, // = 
I, 2, 3; ./ " K 2, 3. The power of any single SPRT is p^^^or /^jjj, and for simplicity, let /\|,, 
= for all h and y. It makes intuitive sense to allow the en*or rates, p^y^. h ^ j\ to vary as a 
function of the distance between 9^ a. id 9j. Specifically, the desired error rate should be less 
when the distance between 9^ and 9j is greater. If d^,-^ represents the distance (i.e., the absolute 
difference) between 9^ and 9j, h ^ y, then |DJ = Z 1/r/hj, summed over y, represents the norm 
of these distances. Then a possible set of error rates with these properties are 

Establishing likelihood ratio boundaries. The likelihood ratio boundaries used to make 
one of the three decisions mentioned above follow straightforward from the simple SPRT 
procedure involving two categories. In order to test H^,: 9 = 0^ versus H^: 9 = 9j, the upper 
boundary is /^ji//^jih and the lower boundary is Ph|//^h|h^ /? = 1, 2, 3; y = 1, 2, 3; h ^ J , 
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In particular, for the case involving two decision points and three categories, 

it L(.v,, AS, A'n|e,e2) < />,|2//?,|, and L(a„ x. A'„|e,eo < Pi\iJPi\\^ Decision ] (ej<o,) is 

made; 

if L(A-p AnlOjOO ^ Pz^y/P2\2^r\d LUi, as, ..m x^\^2,^i) ^ Pi\Mh\\> Decision 2 (8i<e,<8o 

is made; and 

if L(Ai, A->, A-je^e,) > Py\^Px\i'dnA L{Ai, A'2» ..o A'nIejP,) > Py\Mh\\^ Dccision 3 (e>5.) is 
made. 

A Sequential Probability Ratio Test Involving k Decision Points 

In general, suppose that the purpose of an SPRT is to classify an examinee into one of 
k + 1 categories, where k is the number of decision points required. For the following discussion, 
it is assumed that k > 2. The k + 1 categories of mutually exclusive classification are % < 
5i < 9, < 5. < 0, < 8,, .... , 6, > 8,j. Once again, the values of 8i, 5^, 83, etc. are established 
a priori. These might represent the criteria for receiving class grades A, B, C, and so on. In 
order to perform the necessary SPRTs, k + 1 values of 6 must be established in the manner 
described previously (i.e., the values of 0 represent midpoints between adjoining decision points). 
These 6 values are used to test |U' + l)-choose-2] simple SPRTs of the form, 61 versus 62, 61 
versus 63, 6, versus 6,^^.1, 9^ versus 93, 9,^ versus 9^^.!. Error rates, 7;^^, and likelihood ratio 
boundaries remain the same as described previously. 

The number of tests necessary for acceptance before a decision is made (i.e., before an 
examinee is classified into one of the k + 1 categories) is A'. As before, if item administration 
is terminated before a classification is made (e.g., MTL is reached), then the region of 
classification containing an estimate of the examinee's ability, 9|, can be obtained and used to 
place the examinee into one of the k + 1 categories. The same situation applies in the A: > 2 case 
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as it did in the 2-category case, in terms of the effect of forced classification on the SPRT 
procedure and the actual error rates. Classification into categories is not as accurate when item 
administration ends when the ! FL value is reached. 

All + 1 )-choose-21 sets of hypotheses are tested, and the following decisions are made, 
based on the results of these tests: 

Decision 1 is made (9j < 5,) when the k tests of H: = 9, are accepted. 

Decision j is made (5j., < 9, < 5j) when the k tests of H: 9, = 9j, j = 2, 3 k. are 

accepted; 

Decision A: + 1 is made (9j > 5j when the k tests of H: 9i = 9^+1 are accepted. 
Results of Computer Simulations 

Computer simulations were conducted to deterinine if the multiple-category SPRT 
procedure produced classifications that were characteristic of a simple, one-decision-point SPRT. 
In other words, did the multiple-category SPRT produce classification error rates and average test 
lengths that were greatest at the decision points? Would the error rate appear to be controlled 
appropriately by the specified power, and if so, by what amount? 

A calibrated 2t)0-item pool was used to simulate multiple-category SPPT classifications 
via computer. Items were calibrated with the BILOG computer program (Mislevy & Bock, 
19S4). Mean estimates of the a-, /?-, and c-parameters for the item pool were 1.18, .48, and .16, 
respectively. Four computer simulations were performed. Simulation I (the simple SPRT) 
required a single decision point A' = 1,{5 = .05, or /;(5) = .43} with 3 sizes of the indifference 
region: (-.20, .30), (-.45, .55), and (-.95, 1.05) with power (i.e., p^ih = Pjij) = -90. Simulation 
II consisted of 2 decision points of Sj = -1.05 and 82 = 1.05, or p{8y) = .23 and /;(52) = .75, 
respectively, again with .90 power. 



Simulation 111 consisted of 3 decision points at -.95, 0,05, and 1,05, or /;(8,) = ,24, /;(80 
~ .43, and = .74, also with power = ,90. Finally, simulation iV required 4 decision points 
at "1.05, -,55, .55, and 1.05, or />(6^) = .23, .31, ,58, and ,75, / = 1,2,3,4, with ,90 power. For 
each simulation, 3 different values of MTL, the maximum test length or maximum number of 
items to be administered before a forced classification was made, were used. Tnese were 10, 20. 
and 50. For any single set of simulation conditions, a sequential test was adm.inistered 100 times 
to an examinee with known ability, 64, where Oj varied systematically from -3.0 to +3,0 in 
increments of .25. 

Two outcome measures were tabulated over each set of 100 replications. Classification 
Error Rate (CE Rate) was the number of times that a simulated examinee with a known ability, 
0,, was misclassified, either before MTL items were presented or after MTL items were 
administered and a forced classification was made. Average Test Length (ATL) was the average 
number of test items administered before an examinee was classified. 

SimiUation /. Figures 1, 2, and 3 show CE Rates for Simulation I ()t = 1) for three sizes 
of the indifference region, respectively; (-.20, .30); (-.45, ,55); and (-,95, 1.05), respectively, for 
the three values of MTL. Figures 4,5, and 6 show ATL for the same conditions, also 
respectively. 

CE Rate peaked at or near the single decision point, 8 = 0.05, regardless of the value of 
MTL (See Figures K 2, and 3). Classification Error was slightly greater for the largest 
indifference region and was also greater for lower values of 6 (See Figure 3). For all three 
indifference regions, CE Rate decreased as MTL increased. The ATL function reached a peak 
at or near 8 = .05. As expected, values of ATL increased when MTL increased and when the 




width of the indifference region decreased. Slightly elevated ATL levels in the upper ability 
region of 6 were noted under all conditions. See Figures 4, 5, and 6. 

Simulation II. Figures 7 and 8 show Classification Error Rate and ATL for Simulation 
11 (A- = 2) for the three values of MTL. The decision points were 5i = -1.05 and 5, = 1.U5. or 
/;(5,) = .23 and /HS.) = .74. The errors once again tended to peak at or near the two decision 
points and were minimized in the tails and in between the two decision points. There was a 
tendency for the error to be higher at the lower decision point, 6,. Values of the ATL also 
reached maximums at or near the decision points, although there were some exceptions for very 
low values of 6. 

Simulation III. For Simulation 111 {k = 3), Figures 9 and 10 show CE Rate and ATL. 
again for the three values of MTL. These figures are con.sistent with the A = 2 situation, in that 
CE Rate and ATL reached maximums at the three decision points. Once again, the error at the 
lowest decision point, 5,, tended to be slightly higher than at the remaining two decision points. 
Misclassification was greatest for the shortest test (i.e., when MTL - 10). The ATL peaked 
dramatically at the lowest decision point for ATL = 50, and, to a lesser extent, when MTL = 20. 
The average length of the test increased considerably with the added decision point (see Figure 
10 versus 8). 

Simulation IV. Figures 11 and 12 show Classification Error Rate and ATL for this 
simulation condition. The error plot shows the familiar patterns in which the greatest 
misclassification occurred at the three decision points. The ATL was greatest at the tv/o lowest 
decision points but peaked again at 63 and 64, as expected. 
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dmiparisons across simulations 

In order to compare the two outcome measures better across simulations, expected values 
of CE and ATL were computed by assuming that Q^was distributed as N((),l). Note that this 
assumption was not necessary in order to conduct the multiple-category SPRT simulations. The 
results appear in Table 1 . This table shows that classification error rate and average test length 
usually increased with the number of decision points. The exception was the ^ = 2 case where 
MTL = 20 and 50. 

Summary and Conclusions 

The extension of the SPRT procedure to multiple decision points for classification appears 
to work as expected. Error rates appeared to be controlled, for the most part, for values of 6 
away from the decision points in a manner similar to the k = 1 case or simple SPRT. Recall that 
in the simple case, the SPRT procedure guarantees that classification errors, a* + p*, will be 
bounded by functions of a and (3, By specifying power a priori, the classification error rate is 
controlled for A* = 1. Likewise, it would appear that specifying power also controls the 
classification errors in the multiple-category situation. However, it is obvious from these results 
that, as k increases, the number of items required to meet the specified classification error rates 
also increases. In a practical testing situation, these large numbers of items may not be practical 
to administer. TTius, the multiple-category SPRT extension may have limited benefits beyond 
use with a relatively small number of decision points. 
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TABLE 1 
Expected CE Rate and ATL 



k 


MTL 


E(CE Rate) 


E(ATL) 


1 


10 


.077 


1.883 




largest indifference 


20 


.074 


1 879 




region 


50 


.073 • 


1.880 


' 


1 


10 


.044 


3319 




medium indifference 


20 


.041 






region 


50 


.040 


3.464 




1 


10 


.040 


5.404 




smallest indifference 


20 


.029 


6.075 




region 


50 


.027 


6.676 






10 


.127 


4.580 






20 


.119 


4.758 




2 


50 


.126 


5.012 






10 


.163 


7.568 






20 


.110 


9.577 




3 


50 


.103 


10.767 






10 


.209 


9.546 






20 


.146 


15.048 




4 


50 


.127 


22.546 
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Figure Captions 

f'ii^mv I. Classification Error Rate for ^ = 1, Smallest Indifference Region: (-.20,.^()) 

l ii^Ufv 2, Classification Error Rate for k = 1, Medium Indifference Region: (-.45..55) 

Fi,Kure 3. Classification Error Rate for * = 1, Largest Indifference Region: (-.95 J. 05) 

f'ii^Ufr 4, ATL for A- = 1, Smallest Indifference Region: (-.2(),.30) 

f'ii^iar 5. ATI. for A = 1, Medium Indifference Region: (-.45,.55) 

Fiyj(re 6. ATL for A' = 1, Largest Indifference Region: (-.95,1.05) 

Figure 7. Classification Error Rate for A' = 2 

FiourcR, ATL for A- = 2 

Fifiiue 9. Classification Error Rate for ^ = 3 

Fiiinre 1(1 ATL for k = 3. 

Fii^ure II: Classification Error Rate for A: = 4 

Fiiinre 12, ATL for A: = 4. 
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